Gene prioritization based on random walks with restarts and absorbing states, to define gene sets regulating drug pharmacodynamics from single-cell analyses

Prioritizing genes for their role in drug sensitivity, is an important step in understanding drugs mechanisms of action and discovering new molecular targets for co-treatment. To formalize this problem, we consider two sets of genes X and P respectively composing the gene signature of cell sensitivity at the drug IC50 and the genes involved in its mechanism of action, as well as a protein interaction network (PPIN) containing the products of X and P as nodes. We introduce Genetrank, a method to prioritize the genes in X for their likelihood to regulate the genes in P. Genetrank uses asymmetric random walks with restarts, absorbing states, and a suitable renormalization scheme. Using novel so-called saturation indices, we show that the conjunction of absorbing states and renormalization yields an exploration of the PPIN which is much more progressive than that afforded by random walks with restarts only. Using MINT as underlying network, we apply Genetrank to a predictive gene signature of cancer cells sensitivity to tumor-necrosis-factor-related apoptosis-inducing ligand (TRAIL), performed in single-cells. Our ranking provides biological insights on drug sensitivity and a gene set considerably enriched in genes regulating TRAIL pharmacodynamics when compared to the most significant differentially expressed genes obtained from a statistical analysis framework alone. We also introduce gene expression radars, a visualization tool embedded in MA plots to assess all pairwise interactions at a glance on graphical representations of transcriptomics data. Genetrank is made available in the Structural Bioinformatics Library (https://sbl.inria.fr/doc/Genetrank-user-manual.html). It should prove useful for mining gene sets in conjunction with a signaling pathway, whenever other approaches yield relatively large sets of genes.


Single-cell differential expression analyses
Gene differential expression analyses quantify the changes in gene expression levels between tested experimental conditions. Although gene functions can often be derived from this type of analyses, the associations can be confounded with incidental gene induction. Therefore interpreting differential expression data with gene set enrichment analysis (GSEA) and pathway analysis can be misleading without attentive curation, producing an unfounded link between a gene differential expression and its functional role in the treatment response. Single-cell differential expression analyses have elevated the issue, where gene differential expression between experimental groups is hindered by gene expression variability between cells.
Although cell-to-cell variability in gene expression is typically overlooked in most analyses, we and others have observed that these differences between clonal cells, can impact the overall cell population response to a stimulation [1][2][3]. Originating from stochastic processes such as transcription initiation, cell-to-cell variability gives rise to an equilibrium of co-existing cellular states within an isogenic cell population [4,5]. The different phases of the cell cycles are one illustration of cell states that robustly proportioned resting cell populations [6], but other functional cell states can be phenomenologically evidenced by the fractional response to cytotoxic cancer drugs for example (IC 50 , Emax>0, [7,8]). However, most single-cell technologies are still unable to access meaningful cell information within clonal populations once cell cycle states signatures are regressed out, and important functional cell states remain confounded in gene expression noise [1]. Apart from cell cycle genes, one hypothesis for the undetected (or unmeasurable) differences in seemingly homogeneous cell populations is that they contain cells in a wide variety of possible cell states that predisposed them to a number of responses or functions such as cell death, impairing pathway enrichment analyses. To recover the molecular determinants of clonal cells response to cancer drugs from the measured gene expression variability, we recently designed a same-cell functional pharmacogenomics approach, named fate-seq, that couples prior knowledge on the cell state (predicted drug response) to the transcriptomic profile of the same cell [1]. With our same-cell approach, we could reveal the molecular factors regulating the efficacy of a drug treatment, from differential expression analyses of one sample of isogenic cells with no gene induction. Although genes differential expressions can now be linked to their functional role in drug response using fate-seq, prioritizing genes as best potential targets for co-treatment remains a difficult task.

Diffusion distances
Indeed, gene prioritization and protein function prediction are challenging due to the smallworld nature of interaction networks [9,10], in particular. To go beyond analysis using the direct neighbors of a node or shortest paths, the value of diffusion distances has been recognized long ago.
Based on the correlation between the expression profile and the phenotype, as well as (diffusion) distances, various similarity measures between genes have been studied [11]. The Diffusion State Distance (DSD) was defined as the L1 norm of m walks (RW) of k steps [12]. The DSD was shown to be more effective than shortest-path distance to transfer functional annotations across nodes in protein-protein interaction networks (PPIN). The DSD was further extended to exploit annotations (weights) on edges, and to exploit an augmented graph incorporating specific interactions [13]. To bias the random walk towards certain nodes, a random walk restarting (RWR) at those nodes can be applied, as initially used in the context of internet surfing [14]. The stationary distribution of the strategy, called the page rank, depends on the restart probability vector [15]. In [16], the minimum of the page rank probability between two nodes is used to qualify the mutual affinity of two proteins in a network. RWR were also used to predict drug-target interactions on heterogeneous networks [17], and also for layered/multiplex graphs, so as to combine complementary pieces of information [18]. Diffusion distances have also been used recently [19] to understand how drugs treat diseases, using both a network of physical interactions and a hierarchy of biological functions. However, in the resulting multiscale interactome, the diffusion profiles are computed using a standard random walk with restart.
A related topic is the problem of ranking differentially expressed genes across two conditions [20]. Following the seminal work on gene set enrichment methods [21], the template of gene set enrichment analysis (GSEA) consist of three steps, namely computing an enrichment score for each gene, estimating its statistical significance, and performing a correction for multiple hypothesis testing. Setting aside the issue of correlations between genes, such methods combine feature selection and clustering (one cluster per cell state/condition) [22], but do not address the question of connecting two sets of genes using an interaction network.
Finally, yet another related challenge is pathway enrichment analysis [23]. Given a set of experimentally determined genes and a database of pathways, the goal here is to find pathways whose genes are over-represented in the gene set of interest. When pathways are known exhaustively, such analysis are sufficient to screen gene sets. If this assumption does not hold, finding intermediates between the gene set and the molecules in a pathway becomes mandatory.
Adding to other prioritization methods [24], Genetrank utilizes prior knowledge from functional cell states (transcriptomic profile of a predicted drug response), protein-protein interactions, and the expected target signaling pathway of a drug of interest.

Contributions
We focus on gene prioritization related to a complex phenotype, based on expression profiles from single-cell RNA-seq. Formally, let X be a set of proteins associated with differentially expressed (DE) genes, and P be a set of proteins involved in a signaling pathway. Rather than to identify DE genes in X that are involved in the pathway of interest, our goal is to prioritize genes in X because they have been previously identified to be involved in a pathway of interest. In our case study, the DE genes have been identified for their effect on cell sensitivity to TRAIL [1]. This dataset comprised DE genes that are the results of a drug efficacy screen: it is a gene signature of cell sensitivity at the drug IC 50 . Therefore the significantly DE genes in this screen are potential drug co-targets; they are not known to be drug targets and Genetrank seeks to prioritize them, to suggest the corresponding gene products as co-targets in combination treatments with the drug used in the screen. Hence, we prioritize genes in X given the knowledge of genes in P (which gene products are involved in the drug mechanisms of action), using an underlying PPIN, to pick out the genes having a higher likelihood to regulate the pathway. Previous work on diffusion distances (RW or RWR) has three limitations in this context. First, in using hit vectors or stationary distributions, all nodes of the network contribute to the comparison of two sources. Instead, we wish to focus on nodes in P. Second, RWR use a bias on sources, but in our case, the sets X and P may be considered on an equal footing, which commands analysis in both directions, i.e. from X to P and from P to X. Third, instead of using a single restart rate [18], we study a filtration (sequence of nested sets) of genes retrieved, in tandem with so-called saturation indices revealing accessibility scales in the PPIN.
To accommodate this rationale, we present a novel analysis technique based on random walks with restarts and absorbing states. Recall that in a Markov chain, an absorbing state is a state which is never exited. Since stationary distributions are irrelevant in this context [25], we resort to hitting probabilities for the nodes on the target set P. As we show, doing so yields scores for pairs of genes in S × T and T × TS, from which a ranking of genes in X is defined. As a case study, we use a dataset of differentially expressed genes involved in the regulation of a cancer drugs pharmacodynamics [1].

Goal: Formal statement
Consider two sets of genes X and P respectively composing the predictive gene signature of sensitivity to a drug and the genes involved in its mechanism of action, as well as a protein interaction network (PPIN) containing the products of X and P as nodes. We introduce Genetrank, a method to prioritize the genes in X for their likelihood to regulate the genes in P.

Biological problem
The main goal of our approach is to ameliorate the ranking of gene signatures from differential expression analyses in order to better select the genes whose products are likely to impact the phenotypic response of the cells. Using fate-seq, we focus on cancer cells briefly treated with the TNF-related Apoptosis Inducing Ligand (TRAIL), so for the drug response to be predicted for each cell that is then profiled by single-cell RNA sequencing [1]. In such single-cell analyses performed with isogenic cells treated together, the stable differences in gene expression between cells from the two groups, namely predicted sensitive and predicted resistant, are small and otherwise confounded in gene expression noise. In addition, the short treatment necessary for the cell response prediction does not induce a detectable genomic response (Mendeley data Dataset https://doi.org/10.17632/m289yp5skd.1 [1,26]) that would confound the functional role of the differentially expressed genes between the two groups with gene induction, as it is often the case in other studies.

Sets X and P
The list of differentially expressed genes obtained in this study constitutes our set X [1], S2 Table in S1 File. The criteria used to define the two groups of single-cells compared in the differential expression analyses giving X, is the activation rate of caspase-8. Caspase-8 is a protein of the receptor-mediated apoptosis pathway, which is part of the TRAIL mechanism of action [8]. Therefore, in our case study the set P is a set of target genes, whose products are proteins of the receptor-mediated apoptosis signaling pathway.
The sets X and P are of size 320 and 49, respectively. Altogether, these sets X and P represent a unique dataset to assess the benefit of our approach in ranking genes based on their likelihood to impact drug response.

PPIN
A number of interaction databases coexists, each with specific features, in particular a trade-off between exhaustivity and confidence. We use MINT due in particular to the compliance with the protein naming standards [27].
The PPIN was constructed from interactions downloaded from the MINT website https:// mint.bio.uniroma2.it/. Only proteins with the species label Homo Sapiens were downloaded. The interacting proteins are identified in UniProtKB format. The resulting network is called the MINT network in the sequel. This initial graph, containing 11,672 vertices representing proteins, and 52,839 edges representing protein-protein interactions, is edited as follows.
First, the PPIN being disconnected, we focus on its largest connected component, encompassing 11,427 vertices. Second, we remove all self-interactions. Third, multiple interactions between the same two proteins are collapsed into a single edge. Summarizing, we obtain a graph with 11,427 vertices and 36,526 edges.

Sets X and P within the PPIN
Selected genes in the sets X and P being absent from the largest connected component of the PPIN were removed, finally obtaining sets X and P of size 227 and 41 (from sets of size 320 and 49 initially).

Variations on the set X
In order to evaluate the impact of the size difference of X and P on the scores, we analyse the symmetry H(I) of Eq (7) for instances involving sets X 0 of varying size. Practically, for each s 2 {|P|, 110, 220} we pick N r (= 1000) random subsets of X. We then compare the distributions of H(I) obtained.

Rationale and positioning with respect to previous work
We consider a set X of experimentally determined genes, and a set P of genes belonging to a pathway. We introduce methods using random walks on graphs with two sets of vertices as input, referred to as sources (S) and targets (T). In order to analyze the (lack of) symmetry between paths joining X and P and vice-versa, we use our methods twice: • direction S = X ⇝ T = P: starting from nodes in X to reach nodes in P; • direction S = P ⇝ T = X: starting from nodes in P to reach nodes in X.
To study the relationship between the node sets S and T, our modifications of diffusion based distances rely on absorbing states and a renormalization scheme. These modifications are actually motivated by two structural properties of interaction networks (Fig 1). The set of sources is S = {s 1 , s 2 , s 3 } and the set of targets is T = {t 1 , t 2 , t 3 , t 4 }. When defining absorbing states, transitions corresponding to blue arcs are removed. This implies that t 2 will not be highlighted because t 1 or t 4 must be visited before any random walk starting from any source s 2 S. Furthermore, the normalization aims at reducing the importance of t 3 in the study (due to its high proximity to the three sources). https://doi.org/10.1371/journal.pone.0268956.g001 The first difficulty owes to a notion of subsidiarity amongst targets, meaning that if some vertices of T are neighbors (or very close from one another) in the graph, targets upstream will artificially modify the weights of targets downstream. Indeed, if a target is just after another (important) target, then its weight will be large even in the absence of direct paths from S ( Fig  1, target t 4 ). In this context, absorbing states make it possible to stop the exploration process at such upstream/ancestor nodes, and force direct connexions from sources to subsidiary targets.
The second difficulty owes to the close vicinity of selected targets to sources (Fig 1, target  t 3 ), motivating the introduction of hit scores (Def. 3). For example, if a target is a neighbor of some sources in the graph, these hit scores decrease the weights of such direct paths, stressing the importance of the other non trivial paths. One can note that this normalization can be done with other functions, e.g. a distance-based function that could be the average of the |T| lengths of shortest paths between every source s 2 S and a given target t.
We now formally introduce our methods.

Graphs
To connect X and P, we consider a PPIN whose vertices are the individual molecules, and whose edges represent pairwise interactions. Such a network is modeled by a vertex-weighted edge-weighted directed graph G = (V, E). The weight of any vertex u 2 V is denoted w u and the weight of any edge uv 2 E is denoted w uv . We assume that w u , w uv 2 (0, 1] for every u 2 V and every uv 2 E. In the unweighted case, we set w u = 1 and w uv = 1 for every u 2 V and every uv 2 E. In the undirected case, we have uv 2 E if and only if vu 2 E. Let n = |V| be the number of vertices and let V = {v 1 , . . ., v n }. The set of out-neighbors (resp. in-neighbors) of a vertex u 2 V is denoted N þ G ðuÞ (resp. N À G ðuÞ). To analyze paths between vertices of X and vertices of P, we formalize Markov chains and random walks.

Model.
In the sequel, we consider two sets of vertices S and T from the graph G, with S \ T = ;.
We define a Markov chain for which the set of states is exactly the set of vertices V. The transition matrix M is defined as follows for every pair (u, v) 2 V × TV: Particular cases of this construction are as follows: (1)).
Recall that a state is absorbing if once reached by a walk, it is never exited. Observe that the set of states T is absorbing in the Markov chain defined by transition matrix M. We now define the Markov chain with restart from M and from a subset S 0 � S. Intuitively, for each vertex u 2 V\T which is not a target, we add a transition to every vertex of S 0 Formally, given a real number r 2 [0, 1), the transition matrix M r S 0 is defined as follows for every pair (u, v) 2 V 2 .
Note that restart transitions may have the same origin/destination as an existing transition in M. (Equivalently, in graph theoretical terms, one has two arcs between the same two vertices). In that case, the probabilities specified in (2) should be added. The transition matrix M is a particular case of M r S 0 when r = 0. Note also that when |S 0 | = 1, one restarts to a single vertex.

Definition. 1 (State distribution) Consider an initial distribution uniform in the set S 0 .
Given any r 2 [0, 1) and any S 0 � S, the state distribution at each step i � 0 is denoted ðuÞ ¼¼ 1 jS 0 j for every u 2 S 0 . Under the assumption that, from every u 2 S 0 there exists a path in G to some v 2 T, the limit of this vector exists when i ! 1, for every value of r 2 [0, 1). We denote it with p M r S 0 . The probabilities in this vector are commonly known as hitting probabilities of the target set T.
Using the target set, we define: Definition. 2 (Hit probability vector) Let T = {t 1 , . . ., t k } be the target set. Given any r 2 [0, 1) and any S 0 � S, the hit vector is composed of the hitting probabilities for states of T.
In the following, we abuse the notation writing M r s instead of M r fsg . Furthermore, we will write M r instead of M r S . Finally, we renormalize the vectors with the hit vector of the chain without restart (i.e. r = 0): Definition. 3 (Hit score) Given r 2 [0, 1], define the score from s to t as The hit score vector associated with each source Finally the rank of the score of a pair (s, t)2S × T is defined as follows: Remark 1 The hit score from Eq 3 incorporates three ingredients, namely (i) a random walk with restart, (ii) absorbing states, and (iii) renormalization by the value obtained without restart. We may define other scores by removing any of these ingredients, e.g. the mechanism of absorbing states.

Scores and their symmetry
3.4.1 Scores: X ⇝P versus P ⇝X. In order to analyze the (lack of) symmetry between paths joining X and P and vice-versa, We apply the previous construction to two settings: (PPIN, X, P). Using a different PPIN, a different experimental gene set X or a different pathway gene set P will affect the score Q (r) (x, p) for a given pair (x, p). In order to make it clearer we define an instance of execution as the triplet I = (PPIN, X, P), and we refer to the score obtained under this instance as Q ðrÞ I ðx; pÞ. However, for conciseness, we simply denote this score Q (r) (x, p).

Symmetry at the gene level.
Consider an experimental gene x 2 X and a pathway gene p 2 P. Using Eq (3), we assess the asymmetry using the ratio of log scores

Symmetry at the instance level.
Consider an instance I = (PPIN, X, P). (Because throughout this paper we make use of a single PPIN (MINT) and the same set P, we will use the shorthand (X) for the instance). To study the symmetry at the instance level, we consider the proportion of pairs (x, p) 2 X × P such that Q ðrÞ I ðx; pÞ > Q ðrÞ I ðp; xÞ. Formally, denoting 1 b the indicator function of the boolean b, the symmetry of the results for I is given by

Genetrank, saturation indices, hits
Using the scores in the two directions X ⇝P versus P ⇝X (Eq 5), we now define a ranking on the genes of X: Definition. 4 (Average score) Let 1 � τ � |P| be an integer. Consider a fixed value of the restart rate r. For a source x, let the average score be the arithmetic mean over the top τ values max(log Q (r) (x, p), log Q (r) (p, x)) observed for p 2 P. The gene network ranking (Genetrank) of genes in X is the ranking associated with the aforementioned average values. The set of top k genes of the ranking is denoted T ðr l Þ t;k . Note that when τ = 1, the ranking of a gene in X is determined by its largest max score. Averaging scores over τ targets makes intuitive sense here when identifying whole connectedness to a pathway.
To assess the stability of this ranking, we proceed as follows. Consider a set of values R = {r 1 , . . ., r N }, sorted by increasing or decreasing value. We define the set of genes found in T ðr l Þ t;k up to a given value r l , with 1 � l � N, by We now use this set to qualify the speed at which we discover the sources in X when increasing the upper bound on the restart rate:

Definition. 5 (Saturation indices for an increasing sequence of values of r.)
The saturation index at threshold r l is the fraction of sources present in jT ð!r l Þ t;k j, that is: The relative saturation index is the latter normalized by the value of k used: In the absence of overlap between consecutive T ðr l Þ t;k , one would have jT ð!r l Þ t;k j ¼ k � l. Thus, normalizing by k provides a measure of the overlap between consecutive sets.
We note in passing that the previous sets can be used to define how many hits in a given ref-

Graphical representations with radar scatter plots
We wish to rank genes from X using genes from P, exploiting the directions X ⇝P and P ⇝X.
3.6.1 Score radar plots. The difficulty in working with values LogCount(x), LogFold-Change(x) for x 2 X represented in an MA plot, is that all pairs (x, p) get mapped onto the same point. To get around this difficulty, we associate a radar plot with each point x 2 X, yielding an overall score radar scatter plot. Each gene score radar plot is defined as follows: • the background of the gene radar plot is colored using a heat map indexed on the largest (X ⇝P or P ⇝X) log score observed for that gene. This background color makes it easy to spot the individual radar plots with high scores.
• the gene radar plot has a number of spokes equal to the top k (user defined) scores.
• on each spoke, two values are found, namely the scores log Q (r) (x, p) and log Q (r) (p, x).
• finally, the radar plot title is set set to the gene name accompanied by the interval of scores (log scale).

Score radar MA plot.
Displaying all individual score radar plots in the LogCount(x), LogFoldChange(x) plane yields the so-called Score radar MA plot.

Complexity and implementation
3.7.1 Complexity. The running time of one instance i.e. computing the hit scores at a fixed r, depends on the sizes of the PPIN, and of sets X and P.
In contrast with the classical stationary probabilities of ergodic Markov chains, hit probabilities are dependent on the initial state. Therefore, where a stationary distribution needs to be computed once and occupies the space of a vector of dimension n = |V|-the number of nodes, hit probabilities need to be computed for each initial state envisioned and they occupy a total space of size |X| × |P|. Exact formulas for stationary probabilities and for hit probabilities both involve matrix inversions [28], at a practical cost of order n 3 . They are therefore not practical for large sparse graphs, and iterative methods are usually preferred [29], leading to an approximate result with any desired accuracy. The simplest iterative method actually computes state probability vectors p i M r S 0 (Def. 1) at successive steps until some convergence criterion is met. It applies both to stationary probabilities and to hit probabilities. It produces an �-approximation with a computation time of order C|E|log(1/�), C being a constant related with the spectrum of the graph. For Markov chains with restart probability r, this constant increases with r and we observe this phenomenon in our experiments.
Practically, we compute the hit scores of Def. 3 using the C++ Marmote library ( [30] and https://wiki.inria.fr/MARMOTE/Welcome). Specifically, we use from Marmote the iterative method which approximates the result with iterations of order m, with m = |E| the number of edges. It is faster and also more stable in practice, since it involves only positive numbers. The stopping criterion chosen for iterations is that the L 1 distance between successive iterates is less than 10 −6 .
Practically, processing one instance took a few minutes (<5) worst-case, on a standard desktop computer (Precision 7920 Tower, 64 Go of RAM, Intel(R) Xeon(R) Silver 4214 CPU at 2.20GHz; OS: Fedora Core 32).

Contenders.
As already noticed (Rmk. 1), the hit score from Eq 3 incorporates three ingredients, namely (i) a random walk with restart, (ii) absorbing states, and (iii) renormalization by the value obtained without restart. To assess the importance of the latter two ingredients, three contenders of nested complexity are considered in the sequel: • pr-affinity: the minimum page rank affinity introduced in [16], using a plain random walk with restart model.
• Genetrank-renorm: score obtained from a random walk with restart and renormalization using Eq 3-but no absorbing state.
• Genetrank-AS: score obtained from a random walk with restart and absorbing statesbut without the renormalization of Eq 3.
• Genetrank: score obtained using all ingredients: random walk with restart, absorbing states, and the renormalization of Eq 3.

Genetrank and saturation indices
The saturation index makes it possible to study the variation of the size of the set of genes selected by the three contenders when r increases or decreases. To promote the vicinity of sources in the PPIN, we process values r by decreasing value (Rmk. 2). We inspect in turn the saturation, relative saturation and number of hits (Figs 2 and 3, S4, S5 Figs in S1 File). We use the median value τ = 20 to compute average scores-see the Supplemental for the values τ = 1 and τ = 40. The methods pr-affinity, Genetrank-renorm, and Genetrank-AS yield a very similar behavior in two respects. First, the saturation gets maximum (one) for large values of k, and is relatively insensitive to the value of r (Fig 2(B), S4(B), S5(B) Figs in S1 File). Also, the relative saturation drops down to very small values rapidly (Fig 2(E), S4(E), S5(E) Figs in S1 File). In sharp contrast, the maximum saturation yielded by Genetrank is equal to �0.4, and shows a marked dependence on the restart rate (Fig 3(B)). The relative saturation is also much more sensitive to the value of k used, with a coherent behavior as a function of k at fixed values of r (Fig 3(E) and 3(F)).
These observations stress the specificity yielded by the combination of renormalization and absorbing states in Genetrank. To further these insights, we proceed with a more detailed analysis of the incidence of the parameters r and τ: • (r) The restarting rate r defines the bias towards sources. Whatever the value of k, in processing values of r in decreasing values, we observe that the saturation increases when r approaches zero (Fig 3(C)). The slope of the curves increases for values of r � 0.1, showing that for such values of the restart rate, the random walks get to explore a larger region of the PPIN. In Genetrank, larger values of the restart rate are therefore instrumental in promoting a more specific exploration. • (τ) Increasing τ consists of averaging on more targets. This averaging yields a marked increase of the saturation (whatever the value of k), especially for small values of r (Fig 3, S6, S7 Figs in S1 File).
Let us now consider the relative saturation (Fig 3(D)-3(F); S6(D)-S6(F), S7(D)-S7(F) Figs in S1 File). Sections of the surface at r = cst yield a monotonic behavior, that is the smaller the restart rate the larger the relative saturation. (We also note that for the first value of r processed, that is r = 0, 8, the relative saturation is always equal to 1/|X|�0.0044.) Interestingly, slices at r = cst are not monotonic. The valleys crossed on such slices show that increasing k increases the saturation but not necessarily the relative saturation. This may be interpreted as accessibility scales in the PPIN.

Evaluating the effect of varying restart rates on scores, using experimentally validated gene hits
Consider the list of reference genes (and their products) that had been experimentally validated following the single-cell functional genomics approach using predictive cell dynamics [1] O15304 (SIVA1), P78537 (BLOC1S1), P0CW18 (PRSS56), P53007 (SLC25A1), Q9Y2X8 (UBE2D4), DNM1L(O00429). Note that Q6UW78(UQCC3) cannot be considered here, as it is not present in our PPIN of interest. We compute the number of genes from this list found in the sets T ðrÞ t;k and T ð!r l Þ t;k . We compare the results of the four methods (Figs 4 and 5, S8, S9 Figs in S1 File). Consistent with the analysis in the previous section, the methods pr-affinity and Genetrankrenorm are rather non specific, with a number of hits yielded by T ðrÞ t;k essentially constant whatever the value of r at a fixed value of k, whatever the value of τ (S4, S8 Figs in S1 File). The method Genetrank-AS shows a more contrived behavior, with the same number of genes uncovered (i.e. 5) for small values of r, yet more variations when varying r at fixed values of k (S9 Fig in S1 File).
The method Genetrank goes one step beyond, namely discovers genes more selectively when increasing k and changing the restart rate (Fig 5). Small values of τ allow retrieving three genes using fewer values of r, clearly showing the specificity/robustness of nodes of X highly ranked with a large restart rate. Conversely, using larger values of τ in conjunction with smaller values of r (larger excursions in the PPIN) allows reporting four genes in total. Except for τ = 1, using large values of r requires larger values of k to retrieve the known genes: for τ = 20 and τ = 41, a single gene is retrieved when r > 0.5.
These experiments show the progressiveness yielded by Genetrank is exploring the PPIN, as opposed to avoid the fast mixing observed in pr-affinity and Genetrankrenorm, and to a lesser extent Genetrank-AS. Methods without absorbing states see the whole PPIN in a more homogeneous way, due to its small-world nature. This behaviour is even more pronounced in our case, as the target genes form a pathway. Indeed, a random walk reaching one such node is likely to discover its neighbors right after (See Comparison to previous work, Sec. 3.3.) When a absorbing states are used, the random walks halts, and the discovery of neighbors does not take place. The introduction of absorbing states therefore appears crucial to localize the exploration of connexions, in conjunction the choice of the restart rate r -generally taken to r = 0.7 in previous work -see [18] and citations therein.

Symmetry analysis on a per-source basis: Radar plots
The symmetry ratio H(I) is a global assessment based on all pairs in the Cartesian product X × TP. For an assessment of the asymmetry on a per gene basis, we resort to radar plots (Sec. 3.6).
Example radar plots for genes experimentally validated are provided are provided on Fig 6). The individual radar plots can be assembled in a global MA plot (Fig 7).

Biological analysis 4.4.1 Differentially expressed genes.
The set of differentially expressed (DE) genes obtained from the single-cell multimodal profiling approach fate-seq, allows comparing transcriptomic profiles between single cells of an isogenic population, grouped by their predicted drug response [1]. Practically, the DE genes have been identified for their effect on cell sensitivity at the drug IC 50 , and represent the signature of drug efficacy. (The DE genes here, are potential combination targets to increase the treatment efficacy.) Although this response prediction allowed the differential analysis using the edgeR likelihood ratio test framework [32] by defining two groups (predicted sensitive vs. tolerant cells), the most significant DE genes (false discovery rate FDR < 0.1 and |log 2(FC)| > 2) still constitute a list of more than 60 genes. Such a large list defies the expected practical set of potential targets that would serve to design co-therapeutic strategies increasing overall treatment efficacy. In addition, ranking only on differential expression might underestimate a gene for its potential function as regulator of the pathway overall activity. As an example, among these gene hits, we have observed that, at equal distance (the shortest path between the source and the target gene caspase-8), the noisier the gene expression was, the larger the effect a gene perturbation had on cell death, and at comparable expression variability: the longer the shortest path, the larger the effect [1]. We reasoned that diffusion distances could also ameliorate ranking of cell-to-cell differential expression analyses. We use Genetrank to sort genes for their connectedness to molecular factors regulating the signaling pathway triggered by the drug of interest (TRAIL). We intersect the list of the 65 most significant DE genes obtained by edgeR, with the set of genes obtained with Genetrank (k = 50, range of values of r: 0.0.8, τ = 41; Eq (8) and S7 Fig in S1 File), resulting in 18 genes (S1 Table in S1 File).
This gene shortlist presents a number of valuable advantages over the list of significant DE genes, as it becomes more manageable for experimental validation, and drug target discovery. In addition we found that this shortlist contained genes that had been previously reported to regulate cell death and importantly, it was enriched in genes that we had experimentally validated for having an effect on TRAIL response.

Previously validated genes.
Out of the 18 genes (S1 Table in S1 File) we prioritized for their likelihood of having an effect on the drug mechanism of action (MoA) using Genetrank, we found 4 genes that we had previously validated experimentally, namely BLOC1S1, DNM1L, UBE2D4, SLC25A1. This result indicates that we successfully enriched the list with gene hits that have functional relevance as co-drug targets. Indeed, among the target genes shortlisted solely based on statistical criteria (differential expression analyses between predicted resistant and predicted sensitive cells to TRAIL), only DNM1L, had previous reports on TRAIL sensitivity. Indeed, DNM1L encoding the dynamin-1-like protein DRP1, involved in mitochondrial division and apoptosis, has been reported to increase TRAIL sensitivity regardless of the co-treatment strategy (recruiting DRP1 at mitochondrial membrane, or inhibiting DRP1), which lead the authors to suggest that DRP1 might not be the target of mdivi-1, its originally reported inhibitor [33,34]. In our experimental screen, cells that were predicted resistant to TRAIL based on low caspase-8 early activation rates, showed increased DNM1L expression levels [1]. Also in this recent study, we could show that DNM1L over expression reduced caspase-8 activation and cell death in response to TRAIL, and consistently, that DRP1 or dynamins inhibition (using mdivi-1 or dynasore), both increase caspase-8 activation and cell death. All together, these results suggest that in addition to the pro-apoptotic role of DRP1 at the mitochondrial membrane, DRP1 could play an anti-apoptotic role at the receptor level on caspase-8, further validating the approach presented here and the relevance of Genetrank.

Novel genes.
Genetrank also puts forward some genes that were initially further down the list and therefore in an unfavorable position to command gene validation or functional studies. JADE1 for example, has been shown to promote apoptosis in renal cancer cells [35], and MUL1 [36,37], which should motivate further experimental investigations.

Method
This work presents a gene prioritization method, Genetrank, which can be coupled with single-cell functional genomics approaches to rank a set of genes X for their likelihood to regulate the cell signaling pathway P, which is targeted by the drug of interest (i.e. rank genes for their impact on drug sensitivity). While diffusion based distances have been used for several problems in interaction network analysis, the Genetrank introduces several refinements. The first one is to use random walks with restarts and absorbing states to focus on certain nodes of the PPIN. The second one is to exploit the asymmetry of random walks from X to P and P to X across the PPIN. The third one is to use a whole set of restart rates to define a filtration (sequence of nested sets) of genes, using saturation indices. The analysis yielded by saturation indices shows the ability of Genetrank to progressively unveil regions of the PPIN, thanks to absorbing states and our renormalization scheme. Instead, classical methods based on random walks (with or without restarts) see the whole network in a more homogeneous fashion due to its small-world nature. In this context, absorbing states force the evaluation of direct connexions between sources and targets, and the renormalization scheme makes it possible to tone down large weights due to the proximity between sources and selected targets. Altogether, these modifications allow Genetrank at various restart rates to progressively explore the network, and incrementally investigate interactions within a pathway. For gene prioritization, our novel ingredients make it possible to perform a delicate study of the interplay operating between the different parameters defining the RW, providing a stratification of genes of X according to their proximity to genes in P.

Biology
As a case study, we used Genetrank with a TRAIL-sensitivity gene signature obtained from the single-cell functional genomics approach using predictive cell dynamics called fate-seq [1]. Here, we show that we could enrich the most significant differentially expressed genes between predicted sensitive and resistant cells, with genes that were experimentally validated for increasing drug treatment efficacy (or previously described as doing so).
The nature and design of large transcriptomic profiling experiments (single-cell and bulk) and their analyses, impose a number of limits on the biological interpretation of gene expression analyses, especially in the context of understanding the determinants of drug sensitivity. Firstly, the differential analyses between cell types of varying drug sensitivities can be confounded with bystander genes (regarding their role in the drug MoA). Secondly, within a cell type, differential analyses between treated versus untreated samples lacks specificity over genes at the origin of the cell response versus the genes induced by the drug in resistant cells (not to mention that sensitive cells are rarely recovered in experiments). Moreover, the subsequent analyses such as gene set enrichment analysis and pathway-based analyses [21,38,39], dependent on prior knowledge of the gene functions and their interactions, often determined with the aforementioned experimental designs. Gene annotations themselves (from Gene Ontology (GO) database for example) might hinder gene-based drug sensitivity predictions, by introducing biases related to errors or incompleteness due to unknown function, protein moonlighting [40], or technical and biological inherent limits [41,42]. Yet, gene expression remain a central piece of data in drug sensitivity prediction [43,44], providing successful use with cancer cell lines to discover gene involved in drug resistance [45] with computational methods using prior knowledge [46][47][48][49][50]. Although some analyses performed these tasks on basal gene expression [24], which aim at harnessing cell states at the origin of drug response (as opposed to drug-perturbed gene expression studies), all pursue cell lines comparative profiling.
However, profiling cell lines remains ineffective with respect to determining the molecular factors involved in the incomplete -or fractional-response of one cell line, due to intrinsic drug resistance of non-genetic origins (a phenomenon observed for all drugs at their IC 50 in all cell lines). Both single-cell genomics and single-cell drug response analyses [8] have revealed a range of heterogeneity within isogenic cell populations (within one "cell type") that has not been fully comprehended up to now, for technical reasons [1]. Therefore, the natural gene expression variability of isogenic cells may be referred to as gene expression noise only because of the actual deficiency in specific gene sets that define cell states such as drug-sensitive or drug-tolerant state (as opposed to a drug-induced state in the resistant cells), that could inform on genes likely to perturb the MoA of the drug of interest. Therefore, single-cell experimental methods to determine the MoA-perturbing genes remain critical to increase treatment efficacy (or reduce treatment toxicity).
Our approach utilizes predicted drug response knowledge from fate-seq to rank genes among MoA-perturbing gene signature and associate prior knowledge from protein-protein interaction networks to favor protein that are connected to the targeted signaling pathway, which may also reveal novel biological activities.

Future work
Our results suggest that combining same-cell functional pharmacogenomics screens such as fate-seq, with gene prioritization technique described here, are promising novel methods to improve gene definitions in GO with respect to their association to novel drug efficacy gene signatures, and help revealing the most effective co-targets for combination therapy.
From the theoretical standpoint, our gene prioritization strategy underscore several future challenges, two of which are of direct interest in biology and medicine. Given a pair of genes highlighted (one source in X, one target in P), the first challenge resides in the identification of sets of intermediate nodes accounting for the (high) hit score observed between these two nodes. Indeed, such intermediates could be used to delineate the biochemistry of interactions (enzymes, non covalent interactions, etc), paving the way to quantitative ODE based models involving reaction rates and/or affinity constants for (sub-) pathways. Specifically, quantitative approaches based on target controllable linear systems, which aim at driving an interaction network to a desired state based using a linear-ODE based system [51,52], could benefit from the interactions and paths unveiled by our models. The second challenge relates to the precise link between the progressive nature of interactions highlighted by our modified diffusion distances, and the hierarchical nature of interactions within complex networks. We indeed anticipate that our tools will prove useful to unveil certain aspects of multiscale interactome models.